Detecting cell-of-origin and cancer-specific methylation features of cell-free DNA from Nanopore sequencing

The Oxford Nanopore (ONT) platform provides portable and rapid genome sequencing, and its ability to natively profile DNA methylation without complex sample processing is attractive for point-of-care real-time sequencing. We recently demonstrated ONT shallow whole-genome sequencing to detect copy number alterations (CNAs) from the circulating tumor DNA (ctDNA) of cancer patients. Here, we show that cell type and cancer-specific methylation changes can also be detected, as well as cancer-associated fragmentation signatures. This feasibility study suggests that ONT shallow WGS could be a powerful tool for liquid biopsy. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02710-1.


Background
Circulating cell-free DNA (cfDNA) can reveal informative features of its tissue of origin, including somatic genome alterations, DNA modifications, and cell type-specific fragmentation patterns [1]. DNA methylation is a promising cfDNA biomarker and is in widespread testing as a cancer screening tool [2]. DNA methylation can also be used to detect the turnover of damaged cells in time-sensitive conditions, such as myocardial infarction, sepsis, and COVID-19 [3][4][5][6]. These studies used bisulfite-based approaches to profile methylation, but alternative approaches include immunoprecipitation [7] and enzymatic conversion [8] techniques.
Accurate calling of DNA methylation from native DNA Oxford Nanopore (ONT) sequencing has matured and now produces single base-pair resolution results highly similar to bisulfite sequencing [9,10]. The ONT platform is portable and has a low cost of setup and a rapid sequencing workflow that can enable real-time medicine [11,12]. Native ONT sequencing requires no complex sample processing steps and no PCR amplification, making it attractive for clinical tests. Bisulfite approaches, in particular, involve significant degradation and loss of input material. For these reasons, whole-genome sequencing (WGS) using the ONT platform is appealing relative to other whole-genome approaches.
Because of its low cost, targeted bisulfite PCR (including multiplexed NGS versions [13]) is also popular for clinical methylation sequencing, and these would not require the native modification calling capability of Nanopore. However, shallow whole-genome approaches that capture multiple genomic features could potentially be more informative, especially with regard to the course of the disease [8,14,15].
ONT has primarily been used for long-read sequencing, but recent work by our group and others has shown that it can be adapted for short fragments without additional processing steps [16][17][18]. As an added benefit, the ability to capture much longer cfDNA fragments than short-read sequencing may lead to new discoveries or biomarkers, as was demonstrated recently in the case of longer fragments during pregnancy [19].
Here, we perform a feasibility study of ONT sequencing for circulating tumor DNA (ctDNA) detection by comparing methylation and several fragmentation features to matched Illumina samples and comparable Illumina-based datasets.

Estimating cell type fractions from cfNano
We first performed cell type deconvolution of healthy plasma cfDNA using DNA methylation data from either published WGBS datasets or our cfNano samples. For the external WGBS datasets, we used the methylation fractions (beta values) that were provided in the published data files. For our cfNano, we performed direct modification calling using the Megalodon software provided by ONT (https:// github. com/ nanop orete ch/ megal odon). To perform deconvolution, we used 1000-2000 marker CpGs per cell type based on a previously published atlas of purified cell types ("MethAtlas" [5,13]) and estimated cell type fractions using non-negative least squares (NNLS) regression as described in [5]. In order to better understand the impact of the relatively low sequencing depth of our cfNano samples (~ 0.2× genome coverage), we first performed deconvolution of all samples using downsampling experiments starting with full sequence depth down to 0.0001× genome coverage ( Fig. 1A and Additional file 1: Figs. S1-S3). Healthy plasma WGBS samples were taken from a recent study of 50-100× genomic coverage [13] (Fig. 1A, left "Fox-Fisher" samples) and another WGBS study with 0.5-1× coverage [20] (Fig. 1A, middle "Nguyen" samples). Finally, healthy cfNano samples were analyzed (Fig. 1A, right "this study"). From full depth down to 0.2× (about 2.5M aligned fragments), all samples were dominated by the expected cell types: monocytes, lymphocytes, megakaryocytes, neutrophils/granulocytes, and sometimes hepatocytes [5]. Cell type proportions became significantly degraded at 0.05× coverage and below (corresponding to less than 700,000 aligned fragments). The common cell types were consistently found across the 23 healthy individuals in the Fox-Fisher dataset, the 3 healthy individuals in the Nguyen dataset, and the 7 healthy individuals in the cfNano dataset, both at full depth (Fig. 1B) and when downsampled to 0.2× depth (Fig. 1C). The same was found when cell type groups, such as lymphocytes, were broken down into individual cell types (Additional file 1: Fig. S4A-C). Notably, a slight epithelial fraction was identified in some of the Fox-Fisher samples at 0.2×, which did not appear at full 80× depth, suggesting a small but measurable amount of noise at the 0.2× coverage level.
The healthy cfNano individuals were divided into two groups based on the source site, with one being collected and sequenced at ISPRO Italy ("BC" samples) and one in Israel ("HU" samples). Despite the HU samples being lower coverage (two were between 0.10-0.15× depth), they displayed relatively similar cell type proportions (Fig. 1B, C and Additional file 1: Fig. S3). In addition to healthy individuals, the Nguyen WGBS dataset and our cfNano dataset also contained individuals being treated for lung adenocarcinoma, marked as "LuAd" in Fig. 1B, C. In the Nguyen WGBS study, samples were collected at the time of acquired resistance to EGFR inhibitors and were divided into those that acquired resistance mutations in EGFR itself (labeled "on-target") vs. those that acquired amplifications in alternative oncogenes MET/ERBB2 (labeled "off-target"). The epithelial cell fraction was much higher in the on-target patients, while the off-target patients had very low or no epithelial fraction (Fig. 1B), consistent with the absence of CNAs in the off-target samples in the original study [20]. The 6 LuAd samples in our cfNano study had a similarly high epithelial fraction (Fig. 1B), which was significantly higher than in the healthy patients (p = 0.004). In all WGBS and cfNano samples, full-depth results were highly similar to 0.2× downsampled results ( Fig. 1C and Additional file 1: Figs. S1-S3). Interestingly, while the Nguyen et al. study interpreted the normal-like methylation levels of the "off-target" tumors as a difference in cancer methylation patterns, our deconvolution results strongly suggest that it is due to the relatively low amount of cancer DNA circulating in the blood.
The fraction of cancer cells in cfDNA ("tumor fraction") can be estimated from somatic copy number alterations (CNAs) using the ichorCNA tool [21], for cancer cells that contain a sufficient degree of aneuploidy. We estimated tumor fraction for our cfNano samples and four matched Illumina WGS samples from LuAd patients (Fig. 1D, Additional file 2: Table S1, and Additional file 5). While the Illumina samples were sequenced at significantly higher depth (median 1.3×), the tumor fraction estimates were highly similar between cfNano and Illumina sequencing (Fig. 1E). Interestingly, the ichorCNA tumor fractions were more similar to the high-depth Illumina samples than the Illumina samples were to themselves when downsampled to the same depth as the cfNano samples (Additional file 1: Fig. S5A).
To compare ichorCNA tumor fraction estimates to methylation-based estimates, we designed a "two-component" deconvolution method based on NNLS regression that used 2253 CpGs with differential methylation between sorted lung epithelia and healthy plasma. This was based on the same array-based MethAtlas samples from [5] as the full deconvolution (Fig. 1F). Three hundred thirty to 1526 of these CpGs were covered by each cfNano sample (Additional file 2: Table S1), which were the CpGs used for NNLS deconvolution. These DNA methylation-based estimates of lung fraction and the ichorCNA estimates of cancer cell fraction were largely in agreement (Fig. 1F, bottom) with all of the six LuAd samples having significantly higher lung fraction compared to the seven healthy plasma samples (p = 0.003). Two LuAd cases were markedly higher in the methylation-based than the CNA-based estimate (BC09 and BC10). While we have no independent data to determine which was the more accurate estimate, we hypothesize that the discrepancy may be due to either wholegenome doubling (WGD) events that are not detected by ichorCNA (WGD occurs in 297/503 or 59% of LuAd tumors from the TCGA project [22]) or damage to normal lung cells surrounding the tumor which die and shed their DNA into circulation [23].
To verify the robustness of methylation-based deconvolution, we used a mutually exclusive set of 13,770 CpGs that could distinguish TCGA LuAd tumors from healthy plasma but were not found in the normal lung epithelia set (Fig. 1G). Before applying the NNLS regression, since most TCGA LuAd samples contain a significant fraction of leukocytes, we corrected the methylation levels of the TCGA LuAd samples based on their non-cancer cell contamination ("purity correction"). After this correction, the tumor fraction estimates of our cfNano samples were highly similar to those based on normal-lung specific CpGs (Fig. 1G, bottom), despite the fact that the two CpG sets were completely non-overlapping. One HU healthy sample (HU005.10) had a higher lung fraction estimate than one of the cancer samples, possibly because this was the cfNano sample with the lowest sequencing coverage (0.11×). However, the methylation-based tumor fraction was still significantly higher in the LuAd samples than in healthy controls (p = 0.004).
We performed all deconvolution analyses using a second, and older, base modification caller (DeepSignal [24]). While Megalodon called 10-20% more CpGs, the majority of CpGs called were in common between the two methods and had identical methylation states (Additional file 1: Fig. S6A). Both the full cell type deconvolution (Additional file 1: Fig. S6B) and the two-component deconvolution (Additional file 1: Fig. S6C) were highly similar between the two callers.

Genomic context of DNA methylation changes detected using cfNano
The deconvolution analysis above was based on unannotated differentially methylated regions. In order to investigate the genomic context of lung cancer-specific DNA methylation, we analyzed one hypomethylation feature associated with cell of origin (lineage-specific transcription factor binding sites) and one associated specifically with transformation (global hypomethylation). For the TFBS analysis, we identified 5974 predicted TFBS that were specific to lung epithelia based on a single-cell ATAC-seq atlas of open chromatin within lung and other primary human tissues [25]. In that study, adult lung tissues from multiple donors contained a strong cluster of lung pneumocytespecific open chromatin regions ("Pal" cluster). This cluster was most strongly enriched for the binding motif for the transcription factor NKX2-1, which is a master regulatory transcription factor in this cell type [26]. NKX2-1 activity is also known to be highly restricted to this cell type [27], and NKX2-1 binding sites were also the most enriched within lung adenocarcinoma ATAC-seq sites in an independent study [28]. Because open chromatin regions are almost universally hypomethylated, we hypothesized that the 5974 predicted NKX2-1 TFBS in lung pneumocytes would be specifically hypomethylated in healthy lung tissues and in lung tumors. We confirmed this using WGBS data from TCGA [29] (Additional file 1: Fig. S7A).
We next plotted plasma cfDNA methylation levels at these same predicted NKX2-1 sites from the published Illumina WGBS studies and our cfNano study ( Fig. 2A). In healthy samples from three WGBS studies and our own cfNano samples, NKX2-1 sites were fully methylated. In contrast, the LuAd samples from both the Nguyen et al. WGBS study ( Fig. 2A, middle) and our cfNano study ( Fig. 2A, right) had substantial demethylation. In both studies, demethylation could only be observed in the higher tumor fraction samples ("on-target" samples in the WGBS study, and samples with ichorCNA TF > 0.15 in the cfNano study). As a negative control, we selected predicted TFBS from a cell type not expected to be found either in healthy plasma or LuAd. We used the adrenal cortical cluster ("Adc" cluster) from [25], which was highly enriched for the KLF5 binding motif.
These sites were fully methylated in plasma samples from both healthy and LuAd individuals (Additional file 1: Fig. S7B). cfNano profiles were nearly identical using DeepSignal methylation calling (Additional file 1: Fig. S7C).
Global DNA hypomethylation is one of the hallmarks of the cancer epigenome. It has long been proposed as a general marker for circulating tumor DNA [31], and this was recently verified for lung cancer using shallow plasma cfDNA WGBS [20]. We have shown that this "global" hypomethylation is not completely global, and occurs preferentially within large domains called partially methylated domains (PMDs) and specifically at CpGs with a local sequence context termed "solo-WCGW" [29]. We replotted WGBS Fig. 2 Genomic context of DNA methylation changes detected using cfNano. A Plasma cfDNA methylation levels were averaged from − 1 to + 1 kb at 5974 pneumocyte-specific NKX2-1 transcription factor binding sites (TFBS) taken from [25]. All methylation values are fold change relative to the flanking region (region from 0.8 to 1 kb from the TFBS). From left to right, plots show 23 healthy plasma samples from [13], 32 healthy plasma samples from [30], 3 healthy and 18 LuAd WGBS samples from [20], and 7 healthy and 6 LuAd cfNano samples from this study. B Average DNA methylation across chr16p, comparing lung tissue WGBS (top) to plasma cfNano samples from this study (bottom). Reference partially methylated domains (PMDs) are taken from [29]. C Methylation delta is shown for all 10-Mbp bins overlapping a reference PMD (methylation delta defined as the average methylation of the bin minus the average methylation genome-wide). Each cancer sample was compared to the group of healthy samples using a one-tailed t-test, and statistical significance is shown using asterisks. D 10-Mbp PMD bins were stratified by copy number status for each cancer sample, and statistically significant differences were calculated by performing one-tailed Wilcoxon tests within each sample.*p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001 methylation data from TCGA normal lung and lung tumor tissues, showing a typical chromosome arm where strong hypomethylation occurs within the PMD regions identified in [29] in the cancer samples (Fig. 2B, top). In our cfNano samples, strong hypomethylation was also found exclusively in the cancer samples in the same PMD regions (Fig. 2B, bottom). To quantify this genome-wide, we plotted the methylation change (relative to the sample-specific whole-genome average) of PMD solo-WCGW CpGs within each 10-Mbp genomic bin that overlapped a common PMD region from [29] (Fig. 2C). As expected, five of the six cancer samples were significantly hypomethylated relative to the healthy controls (p < 0.0001). Overall, there was significantly more hypomethylation across all cancer sample bins (mean = − 0.10, SD = 0.07) than across all healthy sample bins (mean = − 0.05, SD = 0.04), corresponding to a p-value < 2.2E−16 by the onesided Wilcoxon test. In the final LuAd case (BC11), no PMD hypomethylation could be detected (Fig. 2C). This is not surprising given the high variability associated with global hypomethylation in cancer [29], a process that is not entirely understood but is affected both passively, through mitotic divisions, and actively, by dysregulation of several chromatin modifiers [32,33].
Reasoning that copy number altered regions would have skewed proportions of tumor-derived DNA and thus different levels of PMD hypomethylation, we divided the PMD bins based on the copy number status of each sample. In four of the five cases with significant hypomethylation overall, the amplified bins had significantly more hypomethylation than diploid regions (Fig. 2D). In the one remaining case (BC09), there were not enough PMDs with CNAs for an accurate measurement. Conversely, deleted regions showed significantly less hypomethylation than diploid regions, although this trend only reached statistical significance in two cases. In the future, the combined analysis of CNAs and global hypomethylation may provide a stronger cancer-specific signal than each feature alone. PMD hypomethylation profiles were nearly identical using DeepSignal methylation calling (Additional file 1: Fig. S7D-F).

cfNano preserves nucleosome positioning signal
Cell-free DNA circulates primarily as mononucleosomal fragments, and mapping the positions of these mononucleosomes can be used to identify cell type-specific positioning (reviewed in [1]). CTCF binding sites provide a good test of whether these signals are detectable, since they eject a central nucleosome and position 10 phased nucleosomes on either side of their binding site [34] (Fig. 3A). Around a set of 9780 CTCF binding sites, cfNano mononucleosome locations recapitulated this expected pattern (Fig. 3B, top), which was identical to the pattern based on matched Illumina WGS of greater sequence depth (Fig. 3B, bottom). These were also identical when both cfNano and Illumina libraries were downsampled to an equal number of 2M fragments (Fig. 3C). In addition to nucleosome positioning, CTCF binding sites also demethylate all CpGs located approximately 200 bp on either side of the binding site [34], and this DNA pattern was also recapitulated in the methylation data from our cfNano samples (Additional file 1: Fig. S8).
We tried the same mononucleosome mapping approach for the 5974 lung-specific NKX2-1 TFBS from Fig. 2A. We could not detect any mononucleosome positioning signal (data not shown). Lung-specific nucleosome positions would only be present on a fraction of the fragments, so the signal from these fragments may be masked by those from non-lung cell types. But given that the inherent nucleosome positioning information is present (as shown by the CTCF example), more advanced normalization and quantification techniques may reveal these cell type-specific fragments more sensitively in the future.

Cancer-associated fragmentation length features of cfNano vs. Illumina WGS
Specific fragment lengths have been associated with cancer-derived cfDNA fragments (reviewed in [1]), and these have been used as accurate cancer classifiers [15]. Specifically, shorter mononucleosome fragments (< 150 bp) tend to be enriched for cancer-derived fragments [35]. Density plots of fragment length showed that our cfNano cancer samples were enriched in these short mononucleosome fragments relative to healthy controls (Fig. 4A). We used the definition from [35] and [15] to calculate the ratio of short mononucleosomes (100-150 bp) to all mononucleosomes (100-220 bp). The short mononucleosome ratio was significantly higher in the high tumor fraction cancer cases (mean = 0.24, SD =0.03) than in the healthy cases (mean = 0.16, SD = 0.03), which corresponded to a t-test p-value of p = 0.038 (Fig. 4B). We compared these ratios calculated from our cfNano libraries with those calculated from the matched Illumina WGS libraries which were available for four of the six cancer samples, and the two library types were strongly correlated (Fig. 4C). We hypothesized that improvements to Nanopore base calling could improve alignment and adapter trimming, so we also compared base calling done with the real-time Guppy basecaller at the time of sequencing ("2019" version) to the new "high accuracy calling" base   (Fig. 4C). While they have not been studied as extensively as mononucleosomes, ref. [35] also showed that dinucleosomes were significantly shorter in cancer fragments than noncancer fragments. This is clear from the density plots of our cfNano samples (Fig. 4A), so we used the size range suggested by ref. [35] to calculate the ratio between short dinucleosomes (275-325 bp) and all dinucleosomes (275-400 bp). The short dinucleosome ratio showed even more separation between cancer vs. healthy cfNano samples than the mononucleosome ratio. The high tumor fraction cancer cases had significantly higher dinucleosome ratios (mean = 0.62, SD = 0.01) than the healthy cases (mean = 0.30, SD = 0.04), corresponding to a t-test p-value of p = 2E−7 (Fig. 4D). We compared dinucleosome ratios calculated from our cfNano libraries with those calculated from the matched Illumina WGS libraries, and they were nearly perfectly correlated (Fig. 4E). When we looked across all samples, the short mononucleosome ratio was highly correlated with the short dinucleosome ratio (Additional file 1: Fig.  S9D). Interestingly, this correlation held across the healthy samples as well as the cancer samples, indicating that the same underlying mechanism affects circulating cfDNA from both cancer and non-cancer cell types.
One of our cfNano samples used a different (non-barcoded) adapter design method from all other libraries, and this sample was a clear outlier in fragment length (Additional file 1: Fig. S9A-D). This reinforces the caution that should be taken when comparing fragmentomic features across different library designs. We also investigated the effect of sequencing depth on cancer-associated features, by comparing full-depth datasets with datasets created by randomly choosing 2M fragments for each library (Additional file 1: Fig. S9E-H). Sequence depth had almost no effect on either cfNano or Illumina samples down to 2M fragments.

Cancer-associated fragment end features of cfNano vs. Illumina WGS
The four bases immediately flanking cfDNA fragmentation sites have a biased sequence composition that differs between cancer-derived and non-cancer-derived fragments [36,37]. To study this in our cfNano samples, we first plotted the 25 most abundant 4-mer end motifs that were identified in an Illumina-based study of healthy plasma cfDNA [36], using a heatmap to indicate motif frequencies in each of our cfNano and matched Illumina samples (Fig. 4F). There was broad agreement across all samples, although some differences between cfNano and Illumina libraries were clearly noticeable. When we plotted the average Nanopore vs. Illumina frequencies in matched samples for all 256 possible 4-mers, it appeared that the less abundant motifs had slightly higher frequencies in Nanopore, while the more abundant motifs had slightly lower frequencies in Nanopore (Fig. 4G). Nevertheless, the relative frequencies were highly concordant overall (PCC = 0.97). These were slightly more concordant when we used the 2022 "high accuracy" (HAC) base calling compared to the original 2019 base calling (PCC = 0.97 vs. PCC = 0.96). The degree of difference between the two batches of cfNano healthy samples ("ISPRO" sequenced in 2019 and "HU" sequenced in 2022) was less than the difference between cfNano and Illumina (PCC = 0.99, (Fig. 4H)).
Of particular interest is the CCCA end motif, which is typically the most abundant 4-mer in healthy plasma, and its reduction was shown to be a cancer marker in several cancer types, including lung cancer [36,37]. CCCA indeed has the highest frequency across all our cfNano and Illumina WGS samples (Fig. 4F-H), and was significantly lower in our three high tumor fraction cancer samples than the healthy samples (Fig. 4I). However, there was a clear difference between the healthy samples generated in the "HU" and "ISPRO" batches, which we presume to be technical since these two batches behaved similarly with respect to fragment length and methylation features. We therefore only did a direct statistical comparison within the ISPRO batch, and indeed CCCA frequency in high TF tumors (mean = 1.7, SD = 0.06) was significantly lower than in ISPRO healthy samples (mean = 1.9, SD = 0.13), leading to a t-test p-value = 0.007 (Fig. 4I).
Additionally, the relative frequencies of four cfNano cancer samples were not concordant with their matched Illumina WGS libraries (Fig. 4J). We conclude from this that end motifs are particularly sensitive to changes in library strategy and sequencing platform, and caution must be taken when comparing across multiple batches. This is not surprising, given that fragment representation can be skewed by a number of variables during library construction and amplification, as well as sequencing errors and downstream bioinformatic steps such as adapter trimming (in our processing, we also exclude fragment ends that are soft-clipped). End motifs are highly susceptible to these biases, because even a single base pair difference results in a completely different motif. Recent benchmarking has highlighted how error frequencies can differ by sequence context between the Nanopore and Illumina platforms [38]. We found 16 additional motifs that were as significant as CCCA, although none survived FDR adjustment and so will have to be validated in larger studies (Additional file 4: Table S3). Like fragment lengths, end-motif frequencies were not sensitive to down sampling to 2M fragments (Additional file 1: Fig. S10A-D).

Discussion
While the sample size is small, our results suggest that cancer-specific features of DNA methylation, fragmentation, and CNA are broadly concordant between cfNano and Illumina-based WGS and WGBS methods. Downsampling analysis showed that the genomic coverage we targeted with cfNano (minimum of 2.5M aligned reads or 0.2× genome coverage) was sufficient to detect cancer-derived DNA in all samples based on DNA methylation. Cancer-associated fragmentomic features were not detected in all samples, but this is likely due to biological variability rather than sequence depth, based on similar studies using Illumina-based approaches [35], as well as our own results running the fragmentomic analyses on downsampled libraries. Notably, our results suggest that short dinucleosomes could be a more robust cancer marker than short mononucleosomes, although this will need to be validated in larger studies.
While most features agreed between cfNano and Illumina-based datasets, we identified fragment end motifs as one that was especially sensitive to sequencing platform differences. With the small sample size here, it is not possible to determine which of the two platforms provides the truer results. It is tempting to hypothesize that Nanopore provides a more accurate representation of actual fragment frequencies, since it does not include any PCR amplification. On the other hand, Nanopore error rates are higher and may lead to less accurate adapter trimming or a different spectrum of sequencing errors within the end motifs (although we did try to control for this by using the reference genome sequence rather than the read sequence and by filtering out reads with soft-clipped ends). Despite these differences, we were able to detect the best-known cancer-specific end motif (CCCA) when we constrained the analysis to samples from our main cfNano batch ("ISPRO"). We also detected slight differences in end-motif frequencies between our own Illumina WGS and earlier published WGS data, and between our two cfNano batches which were sequenced two years apart. This indicates that endmotif frequencies are susceptible to batch effects in general and should be analyzed cautiously between batches. In the future, proper controls or normalizations should be developed to compare this feature across datasets sequenced at different times. In general, it is important to note that both the library construction kits and the underlying chemistry for Oxford Nanopore sequencing are still evolving and that future versions may continue to cause differences in the representation of fragmentation features such as length and end motifs.
With the small number of metastatic cancer samples used in our cfNano study, it was not possible to define the lower limit of methylation sensitivity. However, the fact that some epithelial/lung content was found when downsampling higher coverage WGBS samples, as well as in our cfNano healthy samples that had the lowest coverage, suggests that specificity needs to be improved at this very low coverage. We believe this could be improved significantly if whole-genome DNA methylation atlases (WGBS or similar) can be generated for individual cell types purified from human tissues. The purified cell type atlas we used here was based on the Illumina HM450k platform, which covers only about 10-15% of cell type-specific methylation markers. Alternatively, cancer markers could come directly from discovery studies that use WGBS to profile cancer plasma cfDNA. Such datasets do exist in the private domain but remain proprietary (e.g., [2]).
Even if deconvolution can be improved bioinformatically, we will almost certainly need higher read coverage for applications that require more sensitive detection. The cfNano samples analyzed here were all from metastatic disease, and all had relatively high tumor fractions (≥ 10%), and this does not represent the situation for cancer screening or detection of minimal residual disease. Thus, the Nanopore platform will need to be able to consistently produce tens of millions of reads to enable those applications. Using cfNano directly to discover new cancer markers would not necessarily require greater coverage but would require large numbers of patient samples.
Several other current limitations of Nanopore should be considered. The cost per base of Nanopore sequencing is currently several-fold higher than Illumina, although the new generation of ONT PromethION sequencers is meant to address this and increase throughput. Single-nucleotide and indel error rates are higher for the ONT platform, which could pose an issue for whole-genome analysis of mutational signatures [39,40], something we do not investigate here but is theoretically possible from cfNano. While Nanopore error rates have improved significantly over the past several years, this is a weak point that should be considered if mutations are a priority.
There are several areas where Nanopore methylation sequencing may provide unique strengths over other methods. Bisulfite-based sequencing leads to DNA fragmentation and degradation and can obscure fragmentation patterns [41]. Additionally, bisulfite sequencing can not distinguish between 5mC and other modifications such as 5hmC, 5fC, and 5CaC, whereas these are all in principle detectable by Nanopore. Longer cfDNA fragments have not been extensively studied due to the limitations of short-read technology, and these could be valuable both for biomarker discovery and the basic biology of cfDNA (as shown recently in [19]).

Conclusions
Despite the current limitations discussed above, the simplicity of native ONT sequencing and the number of features that can be extracted from a single run, combined with the low cost and portability of sequencer, make it an interesting proposition for clinical settings. Fast sample prep and sequencing times can allow a complete methylation analysis from sample preparation to computational classification in as little as 1-3 h, enabling real-time medical applications in cancer [11,12]. Because DNA methylation can differentiate non-cancer cell types as well, Nanopore liquid biopsy could be used to monitor collateral damage to adjacent tissue in cancer [23] or urgent conditions in other areas of medicine such as myocardial infarction, sepsis, and COVID-19 [4][5][6]42].
While the sample size of the current study is too small to determine the limits of sensitivity and specificity, we find that both cell type-specific and cancer-specific methylation features can be reliably detected in most of our samples, as well as copy number alterations and cancer-specific fragmentation features. The results provide confidence for pursuing this approach in larger studies.

ISPRO plasma cfDNA samples, library construction, and sequencing
ISPRO samples, library construction, and sequencing were described in our initial publication of these sequences [17]. The original sample names from that study are listed in Additional file 2: Table S1. Notably, one sample (S1/19_326) was produced using a different library kit (SQK-LSK109 vs. NBD-EXP104+SQK-LSK109 for all other samples). This is the singleplex library kit, which results in shorter adapterligated templates overall (due to the lack of barcodes) and thus responds differently to the equivalent clean-up bead concentration. Also, adapter trimming is performed differently in 19_326 due to the library kit differences. For these reasons, fragmentomic properties are not directly comparable between 19_326 and other samples. We thus analyzed 19_326 separately for all fragmentomic analyses (included in Additional file 1: Figs. S9-S10) but included it in all figures when analyzing the methylation and copy number alterations, where small differences in fragment length are not expected to make a difference. Standard MinKNOW runtime control was used without modification (S1 using distribution version 18, and all others using version 19).

HU plasma cfDNA samples, library construction, and sequencing
HU healthy samples are cfDNA extracted from 4-mL plasma, as originally described for these samples in [13]. The original sample names from that study are listed in Additional file 2: Table S1. Barcoded libraries were created using the NBD-EXP104 and SQK-LSK109 kits as described for the ISPRO samples. They were sequenced on a single flow cell, using standard MinKNOW runtime control (distribution v.21.11.7) without modification.

Megalodon modification mapping to produce mod_mappings.bam files
Demultiplexed Fast5 files from the "2022 high accuracy calling (HAC) base calling and alignment of Nanopore fast5 files" section above were processed using Megalodon v. 2.4.2 with the following command-line parameters "--edge-buffer 0 --mod-min-prob 0 --guppy-params '-d /usr/local/hurcs/guppy/6.0.1/data --barcode_kits EXP-NBD104 --trim_barcodes' --remora-modified-bases dna_r9.4.1_e8 hac 0.0.0 5mc CG 0 --guppyconfig dna_r9.4.1_450bps_hac.cfg. " Internally, Megalodon used Guppy server version 6.0.1+652ffd1 and base calling model r9.4.1_450bps_hac. By default, Megalodon filters out multi-mapping (supplementary) reads and uses the minimap2 "map-ont" mode to filter low-quality mappings. Each individual Fast5 tile was run individually, and the resulting mod_mapping.bam files were merged into a single mod_mappings.bam file using samtools merge (v1.14). Samtools/HTSlib versions before v.1.14 were not able to handle the Mm/Ml modification stages. Because Megalodon reports only the reference sequence in the BAM records, and does not report any base substitutions, these are anonymous BAM files which do not contain any genetic information, and thus contain no personally identifiable information and can be shared publicly. These are the primary files used for all methylation analysis, described in more detail below, and are available from GEO GSE185307 and at Zenodo DOI: 10.5281/zenodo.6642503.

Extracting methylation beta values from Megalodon mod_mapping.bam files
Modification mapping by Megalodon to produce mod_mapping.bam files is described above. To extract (stranded) methylation information from the mod_mapping.bam files, we used modbam2bed (https:// github. com/ epi2me-labs/ modba m2bed) v.0.4.5, specifying a minimum probability threshold of 0.667, and filtering out positions with 0 confident reads using awk. The full command line was "modbam2bed --cpg -t 4 -a 0.333 -b 0.667 | awk '($5>0){print} > out.bed. " All coordinates are in GRCh38 and are 0-based. These files are named "*.5mC.cut0.667.hg38.bed.gz. " Column 11 corresponds to the percent of reads methylated. Modbam2bed does not provide a column for the actual number of reads that this percentage is based on, but it can be calculated from the other columns. readCount=(col5*col10)/1000. We also provide a simple bedgraph with just the methylation fraction (beta) values in files named "*cut0.667.hg38.sorted.bedgraph. gz. " These can be loaded into any genome browser. Both file types are available in GEO accession GSE185307 and at Zenodo DOI: 10.5281/zenodo.6642503.

Mapping of cfNano methylation data to HM450k probes
Using the zero-based stranded bed files from modbam2bed ("5mC.cut0.667.hg38.bed. gz" files), we mapped each CpG covering either the forward or reverse strand of each CpG on the Infinium 450k array. For each modbam2bed stranded column, we first got the readCount as (col5*col10)/1000. We then multiplied the methylation percentage by the read count to get the number of methylated reads. Then, we divided the sum of the methylated read counts for the two strands, by the sum of the total read counts for the two strands, to get the unstranded percent methylation (beta value).

Methylation calling from external WGBS datasets
For Fisher-Fox et al., methylation "beta.gz" files were obtained from GSE186888 and processed as recommended using wgbs_tools (https:// github. com/ nloyf er/ wgbs_ tools) beta2bed function to obtain fraction methylated and read count for each CpG.

ichorCNA analysis
BAM files from the 2022 HAC base calling and alignment step above were used as input. Samtools (Version 1.9) was used to filter BAM alignments, unmapped reads, secondary and supplementary reads, reads with mapping quality less than 20 as in [44], and reads longer than 700 bp. For Illumina alignments, we trimmed all "N" nucleotides from the 3′ ends of fastq data, alignments were performed to GCF_000001405.39_GRCh38. p13 with BWA mem [44], and duplicates were marked using picard MarkDuplicates and removed with samtools; read pairs without the properly-paired flag were removed. Pipelines used for preprocessing and filtering of both Nanopore and Illumina data are available at https:// github. com/ Puput nik/ Fragm entom ics_ Genom Biol. Somatic copy number analysis was performed using the ichorCNA package v.0.3.2 [21].
We used ichorCNA to determine copy number alterations and tumor fraction for each cancer sample. If the percentage of genome covered by CN alterations was less than 15%, then the tumor fraction was determined to be unstable and set to 0. ichorCNA tumor fraction estimates are available in Additional file 2: Table S1, and genomic CNA plots for all samples are available as Additional file 5. The ichorCNA parameters are, "--ploidy c(2) --normal c(0.5) --maxCN 7 --includeHOMD False --estimateNormal True --esti-matePloidy True --estimateScPrevalence True --altFracThreshold 0.001 --rmCentromer-eFlankLength 1000000. "

Two-component cell type methylation deconvolution using healthy lung epithelia
To determine the lung fraction from different datasets, we devised a "two-component" version of the NNLS regression model described above. To compose the atlas of differentially methylated probes in 25 human tissues and cell types, we used the data collected and tissue-specific feature selection method from the MethAtlas package (https:// github. com/ nloyf er/ meth_ atlas) [5]. The script feature_selection_function.m (https:// github. com/ methy lgram marlab/ cfdna-ont/ blob/ main/ decon volut ion_ code/ cell_ type_ probes/ creat ing_ refer ence_ atlas/ featu re_ selec tion_ funct ion.m) was used to select Lung_cell epithelial-specific CpGs. For the Megalodon version, the cutoff was set to select the top 1000 hypermethylated and the top 1000 hypomethylated probes, using the three Lung_cell epithelia samples vs. the four healthy plasma cfDNA samples from [5]. For DeepSignal methylation data, the procedure was the same except we used the top 2000 hypermethylated and top 2000 hypomethylated CpGs, to account for the significantly smaller number of CpGs called in the DeepSignal data (shown in Additional file 1: Fig. S6A). We removed any probe that did not have valid (non-NA) values for 2 or more of the Lung_cell samples and 2 or more of the healthy plasma samples.
For each probe, the 450k beta values were averaged across three lung samples to produce a single Lung-specific beta value X 1 . The same was done for the four plasma cfDNA samples from to yield a healthy cfDNA beta value X 2 . We used the Lawson-Hanson algorithm for non-negative least squares (NNLS) (https:// cran.r-proje ct. org/ web/ packa ges/ nnls) to perform non-negative least squares regression as in [5]. Specifically, we identified non-negative coefficients β 1 and β 2 , representing the fraction of lung cells and normal blood cells in the Nanopore cfDNA mixture, respectively, subject to the constraints argmin β ||Xβ − Y|| 2 and β ≥ 0. Then, a single lung fraction β was determined by having β 1 and β 2 sum to 1, with the equation = β 1 (β 1 +β 2 ) .

Two-component cell type methylation deconvolution using TCGA lung tumors
We downloaded the Infinium 450k beta value files for TCGA lung adenocarcinoma (LUAD) tumors using the ELMER packaged in Bioconductor [45]. We removed any probe that did not have valid (non-NA) values for 2 or more of the LUAD samples and 2 or more of the healthy plasma samples. In order to make this analysis completely independent from the healthy lung epithelia deconvolution analysis, we excluded 488 CpGs that were included in the Megalodon "Two-component cell type methylation deconvolution using healthy lung epithelia" analysis above, and an additional 396 that were included in the DeepSignal analysis. We then performed a t-test to compare the methylation beta values of these LUAD-specific probes to the four plasma cfDNA samples from the MethAtlas paper [5], requiring a Benjamini-Hochberg corrected FDR of < 0.001 and an absolute beta value difference of 0.3 or greater.

Correcting TCGA methylation model for cancer cell purity
NNLS was performed for the TCGA lung tumor deconvolution as described above in "Two-component cell type methylation deconvolution using healthy lung epithelia", with the following adaptation. The deconvolution assumes that each of the reference cell types is a representation of the purified cell type, but this is not the case for bulk TCGA tumors which have a median of leukocyte fraction of 30% [22]. For each probe in each TCGA cancer sample, we corrected for this by solving for the equation where M m is the methylation of the mixture, M c is (unknown) methylation of the cancer cells, M l is the (known) methylation of the leukocytes, and β is the (known) percentage of cancer cells in the mixture. M l was taken as the average of white blood cell samples from the MethAtlas [5], and β was taken as the "tumor purity" estimate based on somatic copy number alterations from the TCGA PanCan Atlas project using the ABSOLUTE program [22], downloaded from the PanCan Atlas website (TCGA_mastercalls.abs_tables_JSedit.fixed.txt, URL https:// gdc. cancer. gov/ about-data/ publi catio ns/ panca natlas). We used the pure cancer cell estimates M c and performed NNLS regression as described above in "Two-component cell type methylation deconvolution using healthy lung epithelia".

DNA methylation in 10-Mbp PMD bins
To generate Fig. 2C, D, GRCh19 segmentation results from our previous CNV analysis [17] were divided into non-overlapping 10-Mb bins. Copy number status of each bin was determined by log2ratio segment mean > 0.10 and < − 0.10 for gain and loss, respectively. For the healthy samples, 10-Mb bins were generated from the whole genome. GRCh38 methylation files were converted to GRCh37 using liftover R package. We selected only the bins overlapping one or more common Partially Methylated Domains (PMDs) from [29]. Within these PMD bins, we took the average of all "solo-WCGW" CpGs overlapping a PMDs, with "solo-WCGW" annotation also from [29]. We calculated the bin average from these CpGs as sum(frac_methylation_each_ CpG)/CpG_count. We then subtracted this bin average from the average of all CpGs in the genome, to get the methylation delta shown in Fig. 2C, D. Common PMD and solo-WCGW annotations were taken from file https:// zwdzwd. s3. amazo naws. com/ pmd/ solo_ WCGW_ inCom monPM Ds_ hg38. bed. gz. Statistical significance between each cancer sample's bins and all pooled healthy sample bins in Fig. 2C was calculated by the one-sided Wilcoxon test (because we decided a priori to look only for hypomethylation in the cancer samples). For copy number analysis in Fig. 2D, each pair of copy number groups was compared using a one-sided Wilcoxon test, to test the hypotheses that diploid should have higher methylation than amplified regions, and deleted regions should have the highest methylation. The files and pipeline used for this analysis (including segmentation results from Martignano et al. [1]) are available at https:// github. com/ Puput nik/ CNV_ Methy lation_ Genome_ Biol_ 2022.

Transcription factor binding site (TFBS) analysis
First, we used HOMER to identify predicted NKX2-1 binding sites (using the HOMER built-in matrix "nkx2.1.motif ") across the GRCh38 genome, and removed any site within the ENCODE blacklist. For normal lung cell analysis, we intersected this list with 6754 ATAC-seq peaks identified in the pneumocyte (PAL) cluster 13 CREs from [46] (downloaded from supplemental table 6 of that paper "Table_S6_ Union_set_of_cCREs.xlsx"). We then selected 5974 peaks that overlapped a predicted NKX2-1 TFBS and centered each on the predicted NXK2-1 TFBS. If multiple TFBS were present in the peak, we took the motif with the highest HOMER log-odds match score. This TFBS set is available as file "nkx2.1.incluster13_distalPeaks_PAL.bed.high-estScoreMotifs.hg38.bed" in at Zenodo DOI: 10.5281/zenodo.6642503. To calculate relative methylation levels, raw methylation levels in each bin were divided by the mean methylation within all bins from − 1000 to − 800 and + 800 to + 1000 across all NKX2-1 sites. For TCGA lung and non-lung samples in Additional file 1: Fig. S7A, we downloaded TCGA WGBS bedgraph files from https:// zwdzwd. github. io/ pmd [29]. We used all WGBS cancer types that were represented by normal tissues in the scATAC-seq atlas, as this was the atlas used to define pneumocyte-specific (PAL) peaks. These TGCA types included LUAD and LUSC (lung tissue from atlas), CRC (transverse colon tissue from atlas), BRCA (breast tissue from atlas), STAD (stomach tissue from atlas), and UCEC (uterus tissue from atlas).

KLF5 transcription factor binding site (TFBS) analysis (Additional file 1: Fig. S7B)
As with NKX.2 above, we used HOMER to identify predicted KLF5 binding sites (using the HOMER built-in matrix "klf5.motif ") across the GRCh38 genome and removed any site within the ENCODE blacklist. As a control, we intersected this list with 9274 ATACseq peaks identified in the cluster 43 CREs from [46] (downloaded from supplemental table 6 of that paper "Table_S6_Union_set_of_cCREs.xlsx"). We then selected 1762 peaks that overlapped a predicted KLF5 TFBS and centered each on the predicted KLF5 TFBS. If multiple TFBS were present in the peak, we took the motif with the highest HOMER log-odds match score. This TFBS set is available as file "klf5.incluster43Distal. txt.highestScoreMotifs.bed" in GEO accession GSE185307.

CTCF nucleosome positioning analysis
We used 9780 evolutionarily conserved CTCF motifs occurring in distal ChIP-seq peaks, which were taken from [34]. Nanopore or Illumina fragments within the size range of 130-155 bp were used for fragment coverage analysis, with reads being extracted from BAMs as described above. These shorter mononucleosomal fragments showed similar nucleosomal patterns but gave higher spatial resolution than 156-180-bp fragments. Deeptools (version 3.5.0) bamCoverage was used with the parameters "--ignoreDuplicates --binSize -bl ENCODE_blacklist -of bedgraph --effectiveGenomeSize 2913022398 --normalizeUsing RPGC. " For Illumina WGS, we used the additional parameter "--extendedReads 145. " The bedgraph was converted to a bigwig file using bigWig-ToBedGraph downloaded from UCSC Genome Browser. This bigwig file was passed to Deeptools computeMatrix with the command line parameters "reference-point --refer-encePoint center -out table.out, " and the table.out file was imported into R to create fragment coverage heatmap.

Fragment length analysis
BAM files from either the 2019 real-time basecalling and alignment, or 2022 HAC base calling and alignment, above, were used as input. Samtools (version 1.9) was used to filter BAM alignments, unmapped reads, secondary and supplementary reads, reads with mapping quality less than 20 as in [44], and reads longer than 700 bp. For Illumina alignments, we trimmed all "N" nucleotides from the 3′ ends of fastq data, alignments were performed to GCF_000001405.39_GRCh38.p13 with BWA mem [44], duplicates were marked using Picard MarkDuplicates and removed with samtools. Pipelines used for preprocessing and filtering of both Nanopore and Illumina data, and analyzed data are available at https:// github. com/ Puput nik/ Fragm entom ics_ Genom Biol. In addition, only reads with barcodes at both ends (obtained using the --require_barcodes_both_ends flag while demultiplexing) were used for fragment length analysis of the multiplexed samples (all except 19_326). Read identifiers of double-barcoded reads are available in the "dou-bleBarcodeIds" file in Zenodo DOI: 10.5281/zenodo.6642503. Reads with soft clipping at either the 5′ or 3′ ends were removed. Fragment length was calculated from the Mini-map2 BAM CIGAR column by summing all counts. Short mononucleosome ratio was calculated as numfrags 100−150bp numfrags 100−220bp (150 bp is the same cutoff for short fragments used in [35]).

End-motif analysis
BAM files from 2019 real-time base calling and alignment or 2022 HAC base calling and alignment above were used as input. Fragments and reads were processed and filtered as in fragment length analysis. For cfNano, we only used read end1 because end2 could occasionally not represent the actual end of the fragment. To avoid biases that would affect end-motif analysis, we also removed reads with any soft clipping at end 1. The first